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Abstract 

We describe a method of using statistically-collected Chinese character groups from a corpus 
to augment a Chinese dictionary. The method is particularly useful for extracting domain- 
specific and regional words not readily available in machine-readable dictionaries. Output 
was evaluated both using human evaluators and against a previously available dictionary. 
We also evaluated performance improvement in automatic Chinese tokenization. Results 
show that our method outputs legitimate words, acronymic constructions, idioms, names 
and titles, as well as technical compounds, many of which were lacking from the original 
dictionary. 



1 Introduction 



Finding new lexical entries for Chinese is hampered by a particularly obscure distinction 
between characters, morphemes, words, and compounds. Even in Indo-European text where 
words can be separated by spaces, no absolute criteria are known for deciding whether a 
collocation constitutes a compound word. Chinese defies such distinctions yet more strongly. 
Characters in Chinese (and in some other Asian languages such as Japanese and Korean) 
are not separated by spaces to delimit words; nor do characters give morphological hints as 



to word boundaries. Each single character carries a meaning and can be ambiguous; most 
are many- way polysemous or homonymous. 

Some characteristics of Chinese words are nonetheless clear. A word in Chinese is usually 
a bigram (two character word), a unigram, a trigram, or a 4-gram. Function words are often 
unigrams, and ra-grams with ra > 4 usually are specific idioms. According to the Frequency 
Dictionary of Modern Chinese (FDMC 1986), among the top 9000 most frequent words, 
26.7% are unigrams, 69.8% are bigrams, 2.7% are trigrams, 0.007% 4-grams, and 0.0002% 
5-grams. Another study (Liu 1987) showed that in general, 75% of Chinese words are 
bigrams, 14% trigrams, 6% ra-grams with ra > 3. 

Inadequate dictionaries have become the major bottleneck to Chinese natural language 
processing. Broad coverage is even more essential than with Indo-European languages, 
because not even the most basic lexicosyntactic analysis can proceed without first identifying 
the word boundaries. Thus a significant number of models for tokenizing or segmenting 
Chinese have recently been proposed, using either rule-based or statistical methods (Chiang 
et al. 1992; Lin et al. 1992; Chang k Chen 1993; Lin et al. 1993; Wu k Tseng 1993; 
Sproat et al. 1994). But all of these approaches rely primarily upon dictionary lookup of 
the potential segments; in spite of experimental heuristics for handling unknown words in 
the input text, accuracy is seriously degraded when dictionary entries are missing. 

Tokenization problems are aggravated by text in specialized domains. Such documents 
typically contain a high percentage of technical or regional terms that are not found in the 
tokenizer's dictionary (machine-readable Chinese dictionaries for specialized domains are 
not readily available). Most effective tokenizers have domain-specific words added manually 
to the dictionary. Such manual strategies are too tedious and inefficient in general. 

This paper discusses a fully automatic statistical tool that extracts words from an un- 
tokenized Chinese text, creating new dictionary entries. In addition, it is desirable to 
identify regional and domain-specific technical terms that are likely to appear repeatedly in 
a large corpus. We extended and re-targeted a tool originally designed for extracting Eng- 
lish compounds and collocations, Xtract, to find words in Chinese. We call the resulting 
tool extract. Words found by CXtract are used to augment our dictionary. 

In the following sections, we first describe the modifications in CXtract for finding 
Chinese words, and the corpus used for training. The resulting words and collocations are 
evaluated by human evaluators, and recall and precision are measured against the tokens in 
the training set. The significance of evaluated results will be discussed. Finally, we discuss 
a preliminary evaluation of the improvement in tokenization performance arising from the 



output of our tool. 



2 A Collocation Extraction Tool 

Xtract was originally developed by Smadja (1993) to extract collocations in an English text. 
It consists of a package of software tools used to find likely co-occurring word groups by 
statistical analysis. 

In the first stage of Xtract, all frequent bigrams are found. These bigram words are 
permitted to occur within a window of 10 positions, specifically, at distance between -5 to 
5 relative to each other. A threshold is set on the frequency, to discard unreliable bigrams. 
The remaining bigrams constitute part of the output from Xtract (along with the output 
from the second stage). 

The second stage looks at a tagged corpus and to find collocations of involving more 
than two words — up to ten — using the bigram words found in the first stage as anchors. 
Again, a frequency threshold is set to discard unreliable collocations. 

Xtract's output consists of two types of collocations. In the simpler case, a collocation is 
an adjacent word sequence such as "stock market" (extracted from the Wall Street Journal). 
More general collocations permit fiexible distances between two word groups, as in "make 
a . . . decision" . 

For our purpose, we were interested in looking for adjacent character groups without 
distances between the groups. We postulated that, just as "stock market" could be regarded 
as a compound word, we would discover that frequently appearing continuous character 
groups are likely to be words in Chinese. We were also interested in looking for multi-word 
collocations in Chinese since these would presumably give us many technical and regional 
terms. 

Because Xtract was originally developed for English, many capabilities for handling non- 
alphabetic languages were lacking. We extended Xtract to process character-based Chinese 
texts without tags. Various stages of the software were also modified to deal with untagged 
texts. 

Other parametric modifications arose from the difference between the distribution of 
characters that make up Chinese words, versus the words that make up English com- 
pounds. For example, the frequency threshold for finding reliable bigrams is different 
because CXtract returns far more Chinese character bigrams than English word bigrams 
returned by Xtract. 



3 Experiment I: Dictionary Augmentation 



Our experiments were aimed at determining whether our statistically-generated output 
contains legitimate words. We are using text from (the Chinese part of) the HKUST 
English- Chinese Parallel Bilingual Corpus (Wu 1994), specifically, transcriptions of the par- 
liamentary proceedings of the Legislative Council. The transcribed Chinese is formalized 
literary Cantonese that is closer to Mandarin than conversational Cantonese. However, more 
vocabulary is preserved from classical literary Chinese than in Mandarin, which affects the 
ratio of bigrams to other words. 

Evaluation of legitimate Chinese words is not trivial. It is straightforward to evaluate 
those outputs that can be found in a machine-readable dictionary such as the one used 
by the tokenizer. However, for unknown words, the only evaluation criterion is human 
judgement. We evaluated the output lexical items from CXtract by both methods. 

3.1 Procedure 

For Experiment I, we used a portion of the corpus containing about 585,000 untokenized 
Chinese characters (which turned out to hold about 400 thousand Chinese words after 
tokenization). The experiment was carried out as follows: 

1: A dictionary of all unigrams of characters found in the text was composed. One 
example is the character iL (U) which can mean "stand" or "establish" by itself. 

2: From the unigram list, we found all the bigrams associated with each unigram and 
obtained a list of all bigrams found. 

3: We kept only bigrams which occur significantly more than chance expectation, and 
which appear in a rigid way (Smadja 1993). This yields a list of possible bigrams and 
most frequent relative distance between the two characters. The distances are kept 
between -5 and 5 as in Xtract since this ultimately gives collocations of lengths up to 
10, which we found sufficient for Chinese. 

4: From this bigram list, we extracted only those bigrams in which the two characters 
occur adjacently. We assumed such bigrams to be Chinese words. For iL (/«'), one 
output bigram was iiS (U fa) which means "legislative", a legitimate word. 

5: Using all bigrams (adjacent and non-adjacent) from (3), we extracted words and 
collocations of lengths greater than two. Outputs with frequency less than 8 were 
discarded. 



6: We divided the output from (5) into lists of trigrams, 4-grams, 5-grams, 6-grams, and 
m-grams where m > 6. One of the trigrams, for example, is ilS.H (U fa ju) which 
means "Legislative Council" and is another legitimate word. 

3.2 Results 

A portion of the list of bigrams obtained from (4) is shown in Figure 1. We obtained 1695 
such bigrams after thresholding. 

Part of the output from (5) is shown in Figure 2. The first and the last numbers on 
each line is the frequency for the occurrence of the ra-grams. 

Parts of the output from (6) are shown in Figures 3, 4, and 5. 
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Figure 1: Part of the bigram output, with glosses 



3.3 Human evaluation 

For the first part of the precision evaluation, we relied on human native speakers of Man- 
darin and Cantonese. Many of the output words, especially domain-specific words and 
collocations, were not found in the tokenizer dictionary. Most importantly, we are inter- 
ested in the percentage of output sequences that are legitimate words that can be used to 
augment the tokenizer. 

Four evaluators were instructed to mark whether each entry of the bigram and trigram 
outputs was a word. The criterion they used was that a word must be able to stand by itself 
and does not need context to have a meaning. To judge whether 4-gram, 5-gram, 6-gram 
and m-gram outputs were words, the evaluators were told to consider an entry a word if it 



Freq 



Collocation 



277 
337 
50 
30 
16 
27 
20 
16 
12 



ft 



m V. Iff, vt 'J- 

Be B 

m& 

Be It 5 s *D * a 

mumm^m 

m& 

mmm m .. m %mii . , . fi 

! » , 3^ B td II o II M ^ 

i: m w 



Figure 2: Part of the CXtract output 
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Figure 3: Part of the trigram, 4-gram and 5-gram output 

was a sequence of shorter words that taken together held a conventional meaning, and did 
not require any additional characters to complete its meaning. 

Besides correct, the evaluators were given three other categories to place the ra-grams. 
Wrong means the entry had no meaning or an incomplete meaning. Unsure means the eval- 
uator was unsure. Note that the percentage in this category is not insignificant, indicating 
the difficulty of defining Chinese word boundaries even by native speakers. Punctuation 
means one or more of the characters was punctuation or ASCII markup. 

Tables 1 and 2 show the results of the human evaluations. The Precision column gives 
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Figure 4: Part of the 6-gram output, with glosses 



the percentage correct over total ra-grams in that category. 

We found some discrepancies between evaluators on the evaluation of correct and unsure 
categories. Most of these cases arose when an ra-gram included the possessive S5 (de), or 
the copula m (shi). We also found some disagreement between evaluators from mainland 
China and those from Hong Kong, particular in recognizing literary idioms. 

The average precision of the bigram output was 78.13%. The average trigram precision 
was 31.3%; 4-gram precision 36.75%; 5-gram precision 49.7%; 6-gram precision 55.2%; and 
the average m-gram precision was 54.09%. 

3.4 Dictionary/text evaluation 

The second part of the evaluation was to compare our output words with the words actually 
present in the text. This gives the recall and precision of our output with respect to the 
training corpus. Unfortunately, the training corpus is untokenized and too large to tokenize 
by hand. We therefore estimated the words in the training corpus by passing it through an 
automatic tokenizer based on the BDC dictionary (BDC 1992). Note that this dictionary's 
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Figure 5: Part of the m-gram output 



entries were not derived from material related to our corpus. The tokens in the original 
tokenized text were again sorted into unique bigrams, trigrams, 4-grams, 5-grams, 6-grams, 
and m-grams with m > 6. Table 3 summarizes the precision, recall, and augmentation of 
our output compared to the words in the text as determined by the automatic tokenizer. 
Precision is the percentage of sequences found by CXtract that were actually words in the 
text. Recall is the percentage of words in the text that were actually found by CXtract. 
Augmentation is the percentage of new words found by CXtract that were judged to be 
correct by human evaluators but were not in the dictionary. 

The recall is low because CXtract does not include ra-grams with frequency lower than 
8. However, we obtained 467 legitimate words or collocations to be added to the dictionary 



Table 1: Human Evaluation of the Bigram Output Precision 
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Table 2: Human Evaluation of ra-gram Output Precision 
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and the total augmentation is 5.73%. The overall precision is 59.3%. 

However, we believe the frequency threshold of 8 was too low and the 585K character 
size of the corpus was too small. Most of the "garbage" output had low frequencies. The 
precision rate can be improved by using a larger data base and raising the threshold as in 



Table 3: Precision, Recall and Augmentation of CXtract Output 
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Experiment II. 

In the following sections, we discuss the significance of the evaluated results. 

3.5 Bigrams are mostly words 

Using human evaluation, we found that 78% of the bigrams extracted by our tool were legit- 
imate words (as compared with 70.9% + 2.9% = 73.8% by evaluation against the automatic 
tokenizer's output). Of all ra-gram classes, the evaluators were least unsure of correctness 
for bigrams, although quite a few classical Chinese terms were difficult for some of the 
evaluators. 

Since the corpus is an official transcript of formal debates, we find many terms from 
classical Chinese which are not in the machine-readable dictionary, such as lift (jin ci, "I 
hereby"). 

Some of the bigrams are acronymic abbreviations of longer terms that are also domain 
specific and not generally found in a dictionary. For example, <p 5i (zhong ying) is derived 
from Ff^H,!!!!! (zhong guo, ying guo), meaning Sino-British. This acronymic derivation 
process is highly productive in Chinese. 

3.6 The whole is greater than the sum of parts 

What is a legitimate word in Chinese? To the average Chinese reader, it has to do with 
the vocabulary and usage patterns s/he acquired. It is sometimes disputable whether iiS 
H (li fa ju, "Legislative Council") constitutes one word or two. But for the purposes of a 
machine translation system, for example, the word H (ju) may be individually translated 
not only into "Council" but also "Station", as in (jing cha ju, "Police Station"). So 

we might incorrectly get "Legislative Station". On the other hand, ilS.H (li fa ju) as a 



single lexical item always maps to "Legislative Council" 

Another example is JzMVti (da bu fen) which means "the majority". Our dictionary 
omits this and the resulting tokenization is (da, "big") and SP-K) (bu fen, "part /partial"). 
It is clear that "majority" is a better translation than "big part". 

3.7 Domain specific compounds 

Many of the ra-grams for ra > 3 found by CXtract are domain-specific compounds. For 
example, due to the topics of discussion in the proceedings, "the year 1997" appears very 
frequently. 

Longer terms are frequently abbreviated into words of three or more characters. For 
example, ff'SIS^j (zhong ying shuang fang) means "bilateral Sino-British", and Ff'SHfa'S 
09 (zhong ying lian he sheng ming) means "Sino-British Joint Declaration". Various titles, 
committee names, council names, projects, treaties, and joint-declarations are also found 
by our tool. Examples are shown in Figure 6. 

Although many of the technical terms are a collocation of different words and sometimes 
acceptable word boundaries are found by the tokenizer, it is preferable that these terms be 
treated as single lexical items for purposes of machine translation, information retrieval, or 
spoken language processing. 

3.8 Idioms and cheng yu 

From ra-gram output where n > 3, we find many idiomatic constructions that could be 
tokenized into series of shorter words. In Chinese especially, there are many four character 
words which form a special idiomatic class known as jSM (cheng yu). There are dictionaries 
of cheng yu with all or nearly all entries being four character idioms (e.g., Chen & Chen 
1983). In the training corpus we used, we discovered new cheng yu that were invented to 
describe a new concept. For example, (jia xin jie ceng) means "sandwich class" 

and is a metaphorical term for families who are not well off but with income just barely too 
high to qualify for welfare assistance. Such invented terms are highly domain dependent, 
as are the usage frequencies of established cheng yu. 

3.9 Names 

Tokenizing Chinese names is a difficult task (Sproat et al. 1994) because Chinese names start 
with a unigram or bigram family name, and are followed by a given name freely composed 
of one or two characters. The given name usually holds some meaning, making it hard 
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Figure 6: Some domain specific terms found by CXtract, with glosses 



to distinguish names from other words. For names, we do not want to tokenize them into 
separate characters. In a large corpus, names are often frequently repeated. For example, in 
our data, the names of some parliamentary members are extracted by our tool as separate 
lexical items. Examples are shown in Figure 7. The last two characters of each example are 
the person's title. 



4 Experiment II: Tokenization Improvement 



Given the significant percentage of augmented words in Experiment I, we can see that many 
entries could be added to the dictionary used for automatic tokenization. 

In the next stage of our work, we used a larger portion of the corpus to obtain more 



Chinese words and collocations, and with higher reliability. These items were converted 
into dictionary format along with their frequency information. 

To obtain a baseline performance, the tokenizer was tested with the original dictionary 
on two separate test sets. It was then tested with the statistically-augmented dictionary 
on the same test sets. Each of the tokenization outputs was evaluated by three human 
evaluators. 
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Figure 7: Some names and titles found by CXtract 

4.1 Procedure 

As training data we used about 2 million Chinese characters taken from the same HKUST 
corpus. This is about 4 times the size used in Experiment I. The tokenizer we used employs 
a maximal matching strategy with frequency preferences. 

The original dictionary for the tokenizer holds 104,501 entries and lacks many of the 
domain-specific and regional words found in the corpus. 

From the first stage of CXtract, we obtained 4,196 unique adjacent bigrams. From 
the second stage, we filtered out any CXtract output that occurred less than 11 times 
and obtained 7,121 lexical candidates. Additional filtering constraints on high-frequency 
characters were also imposed on all candidates.^ After all automatic filtering, we were left 
with 5,554 new dictionary entries. 

Since the original dictionary entries employed frequency categories of integer value from 
1 to 5, we converted the frequency for each lexical item from the second stage output to 
refined version of tlie finguistic fiftering is discussed in Wu & Fung (1994). 



this same range by scaling. The adjacent bigrams from the first stage were assigned the 
frequency number 1 (the lowest priority). 

The converted CXtract outputs with frequency information were appended to the dic- 
tionary. Some of the appended items were already in the dictionary. In this case, the 
tokenization process uses the higher frequency between the original dictionary entry and 
the the CXtract-generated entry. 

The total number of entries in the augmented dictionary is 110,055, an increase of 5.3% 
over the original dictionary size of 104,501. 

4.2 Results 

Two independent test sets of sentences were drawn from the corpus by random sampling 
with replacement. TESTSET I contained 300 sentences, and TESTSET II contained 200 
sentences. Both sets contain unretouched sentences with occasional noise and a large pro- 
portion of unknown words, i.e., words not present in the original dictionary. (Sentences in 
the corpus are heuristically determined.) 

Each test set was tokenized twice. Baseline is the tokenization produced using the 
original dictionary only. Augmented is the tokenization produced using the dictionary aug- 
mented by CXtract. 

Three human evaluators evaluated each of the test sets on both baseline and augmented 
tokenizations. Two types of errors were counted: false joins and false breaks. A false join 
occurs where there should have been a boundary between the characters, and a false break 
occurs where the characters should have been linked. A conservative evaluation method 
was used, where the evaluators were told to not to mark errors when they felt that multiple 
tokenization alternatives were acceptable. 

The results are shown in Tables 4, 5, and 6. Baseline error is computed as the ratio 
of the number of errors in the baseline tokenization to the total number of tokens found. 
Augmented error is the ratio of the total number of errors in the augmented tokenization 
to the total number of tokens found. 

Our baseline rates demonstrate how sensitive tokenization performance is to dictio- 
nary coverage. The accuracy rate of 76% is extremely low compared with other reported 
percentages which generally fall around the 90's (Chiang et al. 1992; Lin et al. 1992; 
Chang & Chen 1993; Lin et al. 1993). We believe that this refiects the tailoring of dictionar- 
ies to the particular domains and genres on which tokenization accuracies are reported. Our 
experiment, on the other hand, refiects a more realistic situation where the dictionary and 



Table 4: Result of TESTSET I - 300 sentences 
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4194 


1128 


27% 


73% 


3893 


731 


19% 


81% 


F 


4194 


1145 


27% 


73% 


3893 


713 


18% 


82% 


G 


4194 


1202 


29% 


71% 


3893 


702 


18% 


82% 


Table 5: Result of TESTSET II - 200 sentences 


Eval- 


# 


Baseline 


Error 


Accu- 


# 


Augmented 


Error 


Accu- 


uator 


tokens 


^ errors 


rate 


racy 


tokens 


^ errors 


rate 


racy 


A 


3083 


737 


24% 


76% 


2890 


375 


13% 


87% 


H 


3083 


489 


16% 


84% 


2890 


322 


11% 


89% 


I 


3083 


545 


18% 


82% 


2890 


339 


12% 


88% 



Table 6: Average accuracy and error rate over all evaluators and test sets 



Experiment 


Total ^ tokens 


Average error 


Error rate 


Accuracy 


Baseline 


7277 


1749 


24% 


76% 


Augmented 


6783 


1061 


16% 


84% 



text are derived from completely independent sources, leading to a very high proportion of 
missing words. Under these realistic conditions, CXtract has shown enormous utility. The 
error reduction rate of 33% was far beyond our initial expectations. 

5 Conclusion 

We have presented a statistical tool, CXtract, that identifies words without supervision on 
untagged Chinese text. Many domain-specific and regional words, names, titles, compounds, 
and idioms that were not found in our machine-readable dictionary were automatically 
extracted by our tool. These lexical items were used to augment the dictionary and to 
improve tokenization. 

The output was evaluated both by human evaluators and by comparison against dic- 
tionary entries. We have also shown that the output of our tool helped improve a Chinese 
tokenizer performance from 76% to 84%, with an error reduction rate of 33%. 
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